Method and system for developing data integration applications with reusable semantic identifiers to represent application data sources and variables

ABSTRACT

A method and system for developing data integration applications with reusable semantic identifiers to represent application data sources and variables. Methods include receiving a set of physical data identifiers that identify physical data fields, associating semantic names with these fields, and executing rules expressed in terms of these semantic names.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims the benefit under 35 U.S.C. §119(e) of thefollowing application, the contents of which are incorporated byreference herein:

-   -   U.S. Provisional Application No. 61/052,548, entitled Method and        System for Structuring and Performing Data Integration and        Conversion Using a Semantic Model, filed on May 12, 2008.

This application is related to the following applications, filed on aneven date herewith:

-   -   U.S. patent application Ser. No. [TBA], entitled Method and        System for Debugging Data Integration Applications with Reusable        Synthetic Data Values;    -   U.S. patent application Ser. No. [TBA], entitled Method and        System for Managing the Development of Data Integration Projects        to Facilitate Project Development and Analysis Thereof;    -   U.S. patent application Ser. No. [TBA], entitled Method and        System for Developing Data Integration Applications with        Reusable Functional Rules that are Managed According to their        Output Variables;    -   U.S. patent application Ser. No. [TBA], entitled Method and        System for Executing a Data Integration Application Using        Executable Units that Operate Independently of Each Other.

BACKGROUND

1. Field of the Invention

The present invention relates to data integration applications, and,more specifically, to abstracting data used in a data integrationapplication by using semantic names.

2. Discussion of Related Art

When a database system is upgraded or replaced, the existing data mustbe transferred to the new system. This process, called data migration,is becoming increasingly expensive as database systems become larger andmore complex. Planning and executing a data migration consumes valuableresources and can often result in considerable downtime. Also, mistakesin data migration can lead to data corruption, which is not anacceptable risk for institutions that handle sensitive data.

These difficulties are compounded when it is necessary to combine datafrom several different data storage systems, a process known as dataintegration. Data integration applications must reconcile data fromseveral potentially incompatible storage systems, convert these datainto a unified format, and load the new data into the target database.These are complicated tasks, and they require careful planning anddetailed knowledge of the structure of the source databases. Errors indata integration are common, difficult to diagnose, and expensive tofix.

In the past, data integration applications have typically been developedfor a specific database upgrade or merger task, and they become uselessafter this task is complete. This ad hoc approach makes it impossible toreuse program code, substantially increasing the cost of development.Also, it tends to produce applications that are written from scratch andnot comprehensively tested, increasing the likelihood of datacorruption.

In light of these problems, there exists a need for an improved methodof developing database applications that minimizes the costs and risksassociated with data migration and data integration.

SUMMARY OF THE INVENTION

This invention provides methods and systems for developing dataintegration applications with reusable semantic identifiers to representapplication data sources and variables.

Under one aspect of the invention, a method is presented that includesreceiving a set of physical data identifiers that specify fields ofphysical data sources, storing in a database a set of semantic names foruse in defining data integration applications, defining, in terms of thereceived semantic names, a data integration application comprisingfunctional rules to extract, transform, and store data, and executingthese rules by replacing each of the semantic names with data from thespecified field of the physical data source.

Under another aspect of the invention, the method further includesautomatically converting the input data values from one datatype toanother as required by the functional operators.

Under another aspect of the invention, the method further includesproviding a set of suggested semantic names and associating one of thesuggested semantic names with a field of a physical data source.

BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS

FIG. 1 is a dataflow diagram that illustrates the operation of anexample application, according to certain embodiments of the invention.

FIG. 2 is a UML package diagram that depicts the coarse dependencies andrelationships among the basic components of the semantic dataintegration system, according to certain embodiments of the invention.

FIG. 3 is a relationship diagram that illustrates the relationshipsamong the various types of project objects stored in the repository,according to certain embodiments of the invention.

FIG. 4 is a UML state diagram that depicts the relationships among thevarious project stages, according to certain embodiments of theinvention.

FIG. 4.1 is a flowchart that depicts the various stages of projectdevelopment, according to certain embodiments of the invention.

FIG. 5 is a relationship diagram that depicts the components of thesemantic model, according to certain embodiments of the invention.

FIG. 6 is a relationship diagram that depicts the structure of asemantic data integration function within an application, according tocertain embodiments of the invention.

FIG. 7 is a relationship diagram that depicts an output-oriented ruledefinition, according to certain embodiments of the invention.

FIG. 8 is a relationship diagram that depicts the use of output-orientedrules in a function, according to certain embodiments of the invention.

FIG. 9 is a relationship diagram that depicts the preferred embodimentof function-level synthetic debugging and testing for semantic dataintegration.

FIG. 10 is a diagram that illustrates the separation of developmentroles in a semantic data integration project, according to certainembodiments of the invention.

FIG. 11 is a control flow relationship diagram that illustrates thecontrol flow within the data integration engine when a sample dataintegration application is executed on a single host, according tocertain embodiments of the invention.

FIG. 12 is a data flow relationship diagram that illustrates the flow ofdata within the data integration engine when a sample data integrationapplication is executed on a single host, according to certainembodiments of the invention.

FIG. 13 is a modified UML collaboration diagram that illustrates thestartup sequence that results when a sample data integration applicationis executed in a distributed environment, according to certainembodiments of the invention.

FIG. 14 is a modified UML collaboration diagram that illustrates theprocess of distributed shared memory replication when a sample dataintegration application is executed in a distributed environment,according to certain embodiments of the invention.

FIG. 15 is a data flow relationship diagram that illustrates the flow ofdata in the data integration engine when a sample data integrationapplication is run in a distributed environment, according to certainembodiments of the invention.

FIG. 16 is a diagram that depicts the various components of a dataintegration engine, according to certain embodiments of the invention.

DETAILED DESCRIPTION

I. Introduction

Preferred embodiments of the present invention provide semantic systemsand methods for developing, deploying, running, maintaining, andanalyzing data integration applications and environments.

Those data integration applications that are relevant to the techniquesdescribed herein are broadly described by the class of applicationsconcerned with the movement and transformation of data between systemsand commonly represented by, but not limited to: data warehousing or ETL(extract-transform-load) applications, data profiling and data qualityapplications, and data migration applications that are concerned withmoving data from old to new systems.

Data integration applications developed and maintained using thesetechniques are developed using a semantic model. At its core, a semanticdevelopment model enables an application to be partially or fullydeveloped without knowledge of the physical data identities (locations,structures, names, types, etc.) being integrated. Physical identitiesare present in the system but they are abstracted with semanticidentities. There are several advantages to this approach: changes tophysical data locations or structures do not automatically prevent theapplication developer from accomplishing real work; a high or intimatelevel of knowledge of the data being integrated is not required;business rules and other application logic developed using semanticidentities can easily be reused and tested from application toapplication regardless of the physicality of the underlying datastructures; and the costs of data mapping exercises can be significantlyreduced over time as the system learns about fields that aresemantically equivalent.

A data integration application developed using the techniques describedherein is preferably stored in a common repository or database. Thisdatabase includes a semantic metadata model that correlates physicallocations and datatypes, representing the source and target data, withsemantic identities. The database also includes representations ofbusiness rules that are defined using semantic identities instead ofphysical identities. The business rules and the semantic model arestored and maintained separately. Thus, application developers do notneed to know the physical locations or datatypes of the source data inorder to implement data transformation functions.

The repository is preferably maintained and updated using a hybridversioning system for data integration projects. This system providesversion control for project artifacts, and also provides a fine-grainedlocking mechanism that controls the ability to edit and execute aproject in various ways according to the project's current stage in thedevelopment process. The hybrid versioning system also interfaces with arelational database, which can be used to efficiently calculate andreport project metrics.

The system's data integration engine executes data integrationapplications using a parallel, distributed architecture. Parallelism isachieved where possible by leveraging multiple redundant data sourcesand distributing the execution of the application across multiple hosts.The techniques disclosed herein are scalable to execution environmentsthat comprise a large number of hosts.

FIG. 16 is a diagram that depicts the various components of a dataintegration system, according to certain embodiments of the invention.The functional logic of the data integration is performed by a hostcomputer [1601], that contains volatile memory [1602], a persistentstorage device such as a hard drive [1608], a processor [1603], and anetwork interface [1604]. Using the network interface, the computer caninteract with databases [1605, 1606]. During the execution of the dataintegration application, the computer extracts data from some of thesedatabases, transforms it according to programmatic data transformationrules, and loads the transformed data into other databases. Though FIG.16 illustrates a system in which the computer is separate from thevarious databases, some or all of the databases may be housed within thehost computer, eliminating the need for a network interface. The datatransformation rules may be executed on a single host, as shown in FIG.16, or they may be distributed across multiple hosts.

The host computer shown in FIG. 16 may also serve as a developmentworkstation. Development workstations are preferably connected to agraphical display device [1607], and to input devices such as a mouse[1609], and a keyboard [1610]. One preferred embodiment of the presentinvention includes a graphical development environment that displays adata integration application as a diagram, in which the datatransformation rules are represented by shapes and the flow of databetween rules is represented by arrows. This visual interface allowsdevelopers to create and manipulate data integration applications at amore intuitive level than, for example, a text-based interface. However,the techniques described herein may also be applied to non-graphicaldevelopment environments.

Each of these features is discussed in more detail in the sections thatfollow.

II. Project Model

FIG. 1 is a dataflow diagram that illustrates the operation of anexample data integration application that will be referenced infollowing sections. An application organizes the execution of a set offunctions, which perform individual units of work, and the flow of databetween those functions. The sample application [101] has threefunctions, represented by boxes, and data flow between those functions,represented by arrows.

In this example, the Read-Data function [102] reads monthlytransactional bank account data from a VSAM file [105] and outputs thatdata for use as input in the next function. The Transform-Data function[103] receives its input from the Read-Data function. Its transformationlogic aggregates those bank account transactions to compute end-of-monthstatus for each account, and outputs the end-of-month status for use asinput to the next function. Finally, the Write-Data function [104]receives the end-of-month status from the Transform-Data function andwrites that data to a flat RDBMS table [106] which will be used toproduce monthly snapshot reports for each bank account.

Development of the sample application [101] begins when a project iscreated for managing the application's development and deployment. Also,a semantic model, separate from the project, is used to store andmaintain the association between physical identities (i.e. the physicallocations and datatypes of the project's source data) and semanticidentities. If no semantic models have been created for the relevantdata, a new semantic model is initialized. If a semantic model for theproject's source data had already been created (e.g. by a prior project,or through ongoing maintenance) then the new project may use theexisting semantic model; thus, it is not necessary to create a newsemantic model for each new project.

After the creation of the project, project-specific artifacts may becreated. These artifacts, discussed in more detail below, are tested andchecked-in to the repository. The project entity also contains anidentifier that represents the current stage of project development. Ateach stage of the project the application is executed by the dataintegration engine in a stage-specific environment. Eventually theapplication is considered complete and the project, and all applicationscontained within the project, are moved into production.

FIG. 2 is a UML package diagram that depicts the coarse dependencies andrelationships among the basic components of the semantic dataintegration system, according to certain embodiments of the invention.

The system repository [201] is a database used by the system's tools andengine. It is centrally deployed in order to capture and share systemobjects across applications, and to provide visibility into dataintegration projects, data usage, application performance, and variousmetrics. The repository consists of three high-level subsystems: arelational database, a source control system, and business logic toimplement functionality such as creating a project, publishing, staging,etc. The database and source control subsystems are provided usingconventional third party technologies. The business logic is implementedwith a J2EE application but could easily be .NET or some otherweb-application technology. The various system tools (semanticmaintenance tool [204], project maintenance tool [205], and developmenttool [206]) connect to these repository subsystems directly as required.

The primary contents of the repository include: the semantic model [202]which captures metadata that describes the contextual or semanticidentities in an enterprise, the actual or physical identities in anenterprise, and various relationships between the semantic identitiesand physical identities, and projects [203] which are system objectsthat group related artifacts necessary for defining and deploying a dataintegration application

The repository is manipulated by system tools including: the semanticmaintenance tool [204], which maintains the semantic model, the projectmaintenance tool [205], which maintains projects and associated data andgenerates reports to various levels of detail across the system, thedevelopment tool [206] which is used to develop data integrationapplications, and the integration engine [207], which executesapplications using a parallel, distributed system, computes runtimestatistics, and stores these statistics in the repository. Additionaldescription of these components and how they interact is included below.

FIG. 3 is a relationship diagram that illustrates the relationshipsamong the various types of project objects stored in the repository,according to certain embodiments of the invention. A relationshipdiagram is a modified UML class diagram that conveys the relationshipsbetween objects or components. The object or component is labeled in arectangular box and a relationship to another object or component isrepresented with a labeled arrow from one box to the other. Therelationship reads from arrow begin to arrow end (the end of the linewith the actual arrow). Like UML class diagrams, these relationshipdiagrams allow for containment to be expressed with an arrow or byplacing the child object visually within the parent object. In somecases the rectangle for an object or component is dashed indicating thatit is not an actual object but it is really a conceptual group (like anabstract class) for the objects shown therein.

A project [203.1] as depicted is one of the many projects [203] shown inFIG. 2. A project is a system object that is used to organize and managethe development and deployment of data integration applications throughvarious stages. A project's stage [301] specifies the current state ofthe project within the development and deployment process. The variousstages that may be associated with a project are described in moredetail below. A project's measures [308] are metrics or statisticsrelevant to the project that are collected after the project is created.

A project's artifacts [302] define the project's applications andsupporting metadata. These are preferably captured as XML files that areversioned using the standard source control functionality implemented bythe development tool. Project artifacts are accumulated after inceptionand include: drawings [303] which are visual descriptions of thefunctions, transformations, and data flow for one or more applications;data access definitions [309] which individually describe a set ofparallel access paths (expressed as URIs) to physical data resources;semantic records [304] which primarily describe the data structures forone or more applications; documentation [305] for the project and itsartifacts, and other artifacts [306] that may be created during the lifeof the project.

The project model [307] is a relational model that represents theproject, its artifacts, and other data and metadata for the project. Theproject model and project measures provide a basis for introspection andanalysis across all projects in the system.

A project's stage also controls where the project can be run. In anenvironment where this system is deployed, individual machines where theengine can run are designated to allow execution only for a specificstage. For example, host machine SYSTEST288 may be designated as asystem testing machine. Any instance of the system's engine that isdeployed on SYSTEST288 will only allow projects in the “system testing”stage [301.2] to run. This additional level of control is compatiblewith how IT departments prefer to isolate business processes byhardware.

For example, the simple application described above [101] might bedeveloped as part of a new project implemented by the IT department of afinancial institution that wishes to gather and analyze additionalmonthly status for individual bank accounts. The project [203.1] wouldbe created by a project manager using the project maintenance tool [205]and the project would begin in the development stage [301.1] (describedbelow). Preliminary project artifacts [302] such as semantic records[304] (described below) would then be defined and added to the projectby a data architect or equivalent. A developer would then use theseartifacts to create drawings [303] which define the transformation logicof the application [101]. As the application is developed and tested,the project would move through various stages (see FIG. 4) until it isfinally placed into production. The project measures [308] would allowthe project manager and others to analyze the project using relationalreporting and analysis techniques in order to improve the company's dataintegration and business processes.

The development tool [206] is conventional, and similar in layout andpurpose to many other existing graphical programming tools that may beused for defining workflow, process flow, or data integrations. Examplesof such tools include Microsoft BizTalk Orchestrator, Vignette BusinessIntegration Studio, and FileNet Process Designer, among others. Theprimary workspace consists of a palette of symbols corresponding tovarious functions that may be performed by the engine [207], and acanvas area for creating a drawing. Prior to creating drawings for aproject, the user is given permission to work on that project by anotheruser, typically a project manager, of the project maintenance tool[205]. These permissions are stored in the repository [201].

From within the development tool, which is installed on the localcomputer of the developer using the tool, the developer is allowed to“check out” a snapshot of the artifacts for any project for which theuser has permission (as defined in the repository). The projectartifacts must include any semantic records [305] and data accessconfigurations [309] that the developer will need to build the drawing;these requisite artifacts were previously defined by another user,typically a data architect, using the project maintenance tool [205].

Within the development tool, the user creates a drawing. Using oursample application for descriptive purposes, this process may work likethis:

After project checkout (defined above), the user drags functions fromthe palette to the canvas area. In the case of our sample application,the user would drag 3 different functions from the palette: one to readdata from a file (necessary for the Read-Data function [101]), one totransform data (necessary for the Transform-Data function [102]), andone to write data to a table (necessary for the Write-Data function[103]). The user would then visually “connect” the functions in thedrawing according to the direction of the data flow for the sampleapplication. Each function has properties that must be configured todefine its specific behavior for the engine.

The user will then edit these properties with standard property editoruser interfaces. The properties specified by the user for the Read-Datafunction include the name of its output semantic record [305.2] whichspecifies the data being read from the file, and the name of a dataaccess configuration [309] which specifies one or more parallel accesspaths (expressed as URIs) to the file. The properties specified by theuser for the Write-Data function include the name of its input semanticrecord [305.1] which specifies the data being written to the table, andthe name of a data access definition [309] which specifies one or moreparallel access paths (expressed as URIs) to the table.

Because the user connected the Read-Data function to the Transform-Datafunction, the input semantic record [305.1] for the Transform-Datafunction [103] is automatically derived from the output semantic recordof the Read-Data function [102] and because the user connected theWrite-Data function to the Transform-Data function, the output semanticrecord [305.2] for the Transform-Data function [103] is automaticallyderived from the input semantic record of the Write-Data function [104].The user will further configure the Transform-Data function in thedrawing by specifying its transformation logic in a transformationeditor. The semantic identities of the input semantic record and outputsemantic record are presented to the user in this editor. In thetransformation editor, the user provides logic that specifies how outputvalues in the semantic record are calculated. When the values are adirect move from input to output, a simple statement such as“output=input” can be used to automatically move data from input tooutput for any like-named semantic identities. When more specific rulesare needed for an output field, they can be specified directly in thelogic, for example:

-   -   output.full_name=string.concatenate(input.first_name, “”,        input.last_name)

As the user builds the drawing and configures each function, thedevelopment tool will validate the drawing according to the rules of theengine and present warnings and errors to the user so that the user maycorrect the drawing. The user has the ability to synthetically debug(FIG. 9) the Transform-Data function from within the development tool.The user may also execute the drawing from the development tool; in thisscenario the execution may be performed by a local instance of theengine which is installed with the development tool, or on a remoteinstance of the engine which has been installed in an environmentconfigured to support such testing. In either case, the machine hostingthe engine requires that any client access technologies relied on by thedata access configurations [309] for each function in the drawingalready be configured on the same machine; for example, in order towrite to a table, the correct database drivers must be configured on themachine whose engine will be using those drivers to perform thatoperation. At any time during this development process, the developermay “check-in” the drawing to the repository. This process isconventional in terms of workflow and implementation; the user mayprovide a comment for the change and a new version of the new drawingwill be added to the source control system in the repository.

III. Hybrid Version Control System

The project artifacts, which are maintained via traditional sourcecontrol mechanisms as described, the project staging controls (describedbelow), and the project model which models those sources in a relationaldatabase, are maintained using a hybrid version control system,comprising both a standard version control system and a relationaldatabase. Traditionally, version control systems have made it possibleto record individual changes made to versioned artifacts, but do notallow for the analysis of these changes using standard relationaldatabase query techniques. Using pure relational database systems,however, it is extremely difficult to provide traditional versioncontrol functionality. Additionally, a traditional source control systemdoes not inherently control access to system sources based on thedevelopment life-cycle stage of the project; such systems must rely onexternally defined and enforced business practices to control access.The hybrid version control system disclosed herein allows for bothtraditional artifact versioning/source control and relational datamodeling of the same artifacts. The hybrid version control system alsoprovides built-in support for controlling access to project sourcesaccording to the current stage of the project.

FIG. 4 is a UML state diagram that depicts the relationships among thevarious project stages [301], according to a preferred embodiment. Thestaging model provides control for moving a project through adevelopment life-cycle.

At any given time, a project may be in at most one of the followingdeployment stages (analogous to states in a state transition diagram):development [301.1], system testing [301.2], integration testing[301.3], user acceptance testing [301.4], readiness testing [301.5], orproduction [301.6].

Each one of these stages has two superstates. The first superstatesignifies whether a project is unlocked [403], which means that changesto project artifacts are allowed, or locked [404] which means thatchanges are not allowed. The second superstate signifies whether aproject is unpublished [401], which means that the project model has notbeen refreshed from the most recent changes to project artifacts, orpublished [402] which means that the project model is fullyrepresentative of the current project artifacts.

In the preferred embodiment, a project is created, published, and stagedusing the project maintenance tool [205]. Individual artifacts andchanges to them are stored as separate versions in the repository'ssource-control system using the system's tools such as the projectmaintenance tool [205] and the development tool [206]. User permissionsrelated to project development may be implemented using any userauthentication/control databases, such as LDAP and ActiveDirectory.

After a project is created, it is unpublished [401] and in thedevelopment stage [301.1]. Artifacts may only be added, modified, orremoved from source control when the project is in the development stagewhich also implies that the project is unlocked [403]. When a project is“published,” all of the information stored about the project in theversion control system, including, e.g., new versions of drawings andfunctions, checkin/checkout log entries and times, etc., is moved into arelational database, the contents of which can be queried usingtraditional relational techniques. After a project is published it willbe in a published state such that the repository's relational model ofthe project has been updated from all current project artifacts insource-control, making the project available for post-developmentdeployment staging.

The project artifacts are moved from the source control system to therelational database using conventional serialization methods andsystems. When it is published to the database, it does not replace theolder published version of the project, but is stored as a separatepublication. Thus, queries executed against the database may gatherinformation and statistics about multiple publications.

If changes are again made to project artifacts while in the developmentstage, the project will again be in an unpublished state until beingexplicitly published again. From a published superstate a project in thedevelopment stage may be staged forward to any post-development stageincluding production [301.6]. After being staged out of development, theproject is in a locked superstate such that artifacts cannot be modifieduntil the project is staged back to development.

As an example, after development [301.1] is complete, the project forthe sample application [101] may be published and moved to a systemtesting stage [301.2]. While in this stage, various system tests areperformed on the application and changes to the project's artifacts areprohibited. If system testing is successful, the project may be moved toan integration testing stage [301.3]. While in this stage, one of thetests uncovers an issue that must be addressed by a slight change to theconfiguration of the Write-Data [104] function in the drawing for theapplication. The project is moved back to the development stage [301.1]so that a developer can make this change. After the change is tested bythe developer and checked-in, the project is published again and movedback to the integration testing stage [301.3] for re-test. Theapplication might then pass testing at this stage and each subsequentstage until it is finally put into production [301.6].

Each time the artifacts are published, the project model [307] andproject measures [308] are updated. Both the project model and measuresare maintained as a relational model in the repository. This enablesproject managers, data architects, decision makers, and other systemusers to query and analyze the project and projects in interesting ways.For example, a project manager may quickly learn in which projects adeveloper user has used a particular semantic record (which may be knownto be broken); or cumulative usage across projects of a certain table;or which output rules for a certain semantic identity are used most.This type of inquiry and analysis is possible because of the publishfunctionality in the repository.

Some project metrics may use information from the source control systemas well as the repository. Because a source file may be checked-out andchecked-in multiple times between publications, only the source controlsystem contains information about these intermediate file-versions.

FIG. 4.1 is a flowchart that depicts the separation of roles across thevarious stages of project development, according to certain embodimentsof the invention. The project manager [4101] creates a project called“foo” [4102] in the source control system, and assigns users [4103] toit. The data architect [4104] then checks out the project [4105] andcreates or modifies the semantic records and data access definitionsthat will be used by the project “foo” [4112] (these are discussed inmore detail below). When this step is complete, the developer [4113]checks out the project [4106] and creates and modifies the project'sdrawings [4107] in the source repository, which specify the datatransformation, extraction, and load rules used by the project anddetermine how data flows among these rules. When complete, the developerchecks the project in [4108]. At this point, the project manager [4101]publishes the project [4109], which moves the project artifacts into therelational database [4110]. After the project has been published, it maybe moved into the “staging” phase [4111]. Eventually, the project statewill be set to “production,” the final phase of the project developmentprocess.

IV. Semantic Model

FIG. 5 is a relationship diagram (as described above) that depicts thecomponents of the semantic model [202] in the repository [201],according to a preferred embodiment. The semantic identity [501] ismetadata that represents the abstract concept or meaning for a singlebusiness object that may be used in an enterprise; for example, anemployee's last name. Additional properties of the semantic identitypertaining to its semantic type, subject area, and composition are alsocaptured in the semantic model.

The output rule [701] defines the business logic for calculating a valuefor the semantic identity within a data integration application. Asemantic identity may have multiple output rules. The output rule andits usage is described in more detail in a later section.

The physical identity [502] is metadata that captures the external(physical) name of a specific business object (e.g., a database column).The physical datatype [504] captures the external (physical) datatype ofthe associated physical identity (e.g., “20 character ASCII string”).The semantic datatype [505] is associated with the semantic identity andspecifies the datatype of the data referenced by the semantic identity,as used internally by the data integration application. The physicaldatatype is used by the engine when it is reading or writing actualphysical data. The semantic datatype is used by the engine whenprocessing transformation logic in the application (described later).

The semantic binding [503] associates a physical identity with aparticular semantic identity. Many physical identities and theirphysical attributes may be associated with the same semantic identity.For example, fields from various physical data locations such aslastName with a physical datatype of CHAR(30) in one RDBMS table,last_name with a physical datatype of VARCHAR(32) in another RDBMStable, and LST NM with a physical datatype of PICX (20) in a COBOLcopybook, may all be physical instantiations of the semantic identitylast_name, which could be universally associated with a semanticdatatype string.

A semantic record [304] describes the layout of a physical datastructure such as an employee table. Each field in the table would bedescribed with a semantic binding that captures the actual column name(the physical identity) and the semantic identity. Other metadataspecific to each field in the employee table, such as data typeinformation, would also be described for each field in the semanticrecord.

Using the semantic maintenance and project maintenance tools, a userwould create and maintain the semantic model as follows. The user wouldfirst locate the actual metadata for the physical data that must berepresented. As an example, using the sample application, this would bethe metadata for the VSAM file being read and the metadata for the RDBMStable being written. The names and types of each field or column wouldbe preserved as physical identities and physical datatypes. Arationalization process, using conventional string matching techniquesand statistical methods, is then performed by the tool that takes eachphysical identity, decomposes it, analyzes it, and suggests zero or moresemantic identities. The user makes the final decision as to whichsemantic identity most applies to each physical identity. When anexisting semantic identity does not apply, the user may define a new oneand its semantic datatype. The physical identity, semantic identity,semantic binding, and other metadata gathered during the rationalizationprocess, are saved in the repository.

The components of the semantic model are described in more detail below.

FIG. 6 is a relationship diagram (as described above) that depicts thestructure of a semantic data integration function within an application(such as the sample application described above) according to apreferred embodiment. A function [601] performs an individual body ofwork within an application. The function in FIG. 6 is a genericrepresentation of any particular function in the present dataintegration system and could represent any of the functions [102],[103], or [104] in the sample application [101].

Depending on the type of function, the function may have the followingtypes of input: input data [603], which is an actual input data valuethat the function will consume when it runs in the engine, an inputsemantic identity [501.1] is a semantic identity [501] from the semanticmodel (FIG. 5) that identifies an individual piece of data in a recordthat will be input to the function, and an input semantic record [305.1]is a semantic record [305] from the semantic model (FIG. 5) thatdescribes the exact structure and format of a data record that will beinput to the function.

Depending on the type of function, the function may have the followingtypes of output: output data [604], which is an actual output data valuethat the function will produce when it runs in the engine, an outputsemantic identity [501.2] is a semantic identity [501] from the semanticmodel (FIG. 5) that identifies an individual piece of data in a recordthat will be output from the function, and an output semantic record[305.2] is a semantic record [305] from the semantic model (FIG. 5) thatdescribes the exact structure and format of a data record that will beoutput from the function.

A data access definition [309] will also be associated with thefunction. When the purpose of the function is to read or write data fromor to a physical data source, the data access definition will specifyone or more URIs for accessing the physical data being read or written,each of which constitutes a parallel processing path (or channel) forthe operation. When the function is an internal operation whose job isto manipulate data that has already been read (prior to writing), thedata access definition identifies the particular channels that arerelevant to the functions it is connected to.

Depending on the type of function, the function may also havetransformation logic [609] which may be used to calculate the outputvalues for the function.

A semantic function is able to correlate input data to output data usingthe semantic identities. For example, if the input semantic record[305.1] includes a field with semantic identity last_name [501.1] whoseactual source is from a column named lastName [502] and if the outputsemantic record [305.2] includes a field with semantic identitylast_name [501.1] whose actual data source is a field named lstNm in afile [502], provided that the semantic model captures theserelationships, the function will know that the two fields aresemantically equivalent because they share the same semantic identitylast_name, and thus can move the correct input data [603] to the correctoutput data [604] with little or no additional specification.

Using our sample application as an example, the output semantic record[305.2] for the Read-Data function [102] may include a semantic binding[503] that binds the output semantic identity [501.2] last_name to aphysical field named LST_NM in the data being read from the VSAM file[105]. The input semantic record [305.1] for the Transform-Data function[103] may include the same semantic binding. The data coming from theVSAM file on the mainframe stores all last names in upper case; ex:SMITH. The transformation logic [602] in the Transform-Data function[103], which is a semantic function [601] like all functions in anapplication for the present invention, may be written to convert theinput data [603] for input semantic identity [501.1] named last_name totitle case; ex: Smith.

In writing this transformation logic, the developer only needs to knowthe semantic name last_name, and does not require any knowledge aboutthe associated physical identity or the attributes of the VSAM sourcewhere the data is physically located. For example, suppose that in adifferent application in a different project, the physical identity forlast name data pulled from a mainframe was called NAME_LAST. As part ofthat effort, the semantic model were updated and a new additionalsemantic binding that associated NAME_LAST to the last_name semanticidentity were created. The same transformation logic responsible forconverting last_name to title case could be used because thetransformation uses the semantic identity last_name that is common toboth physical identities, LST NM and NAME_LAST.

As a more complete example, suppose the VSAM file read by the exampleapplication has the following physical description:

TABLE 1 VSAM Metadata Physical Identity Physical Datatype ACC-NOPICX(20) TRANS-TYPE PICX(1) TRANS-AMT 9(12)V9(2) LAST-NAME PICX(20)FIRST-NAME PICX(20) . . . . . .

Suppose further that the RDBMS table written by the example applicationhas the following physical description:

TABLE 2 RDBMS Metadata Physical Identity Physical Datatype accIdVARCHAR(32) accBal NUMERIC(10, 2) lastName VARCHAR(32) firstNameVARCHAR(32) . . . . . .

Outside of the context of the application project, a user would use thesemantic maintenance tool to import the physical identities specified inTables 1 and 2, in order to rationalize these physical identities tosemantic identities, as described above (if the repository alreadycontains semantic records corresponding to these two data tables, thenit would not be necessary to import these physical identities again; forpresent purposes we assume that they are being imported for the firsttime). At this point, for each of these physical identities, thesemantic maintenance tool will suggest corresponding semanticidentities. The user can affirm or override these suggestions.

When this process is completed, the result is a mapping of (physicalidentity, semantic identity) pairs. Suppose, for the purposes of thepresent example, that this mapping is specified as follows:

TABLE 3 Mapping from Physical to Semantic Identities Physical IdentitySemantic Identity ACC-NO account_number accId account_number accBalaccount_balance TRANS-TYPE transaction_type TRANS-AMT transaction_amountLAST-NAME last_name lastName last_name FIRST-NAME first_name firstNamefirst_name . . . . . .

At this point, the user may associate the title-case rule (as describedabove) with the semantic identity last_name. This rule, along with anyother rules created by the user and associated with semantic identities,are stored in the repository.

The user may now create semantic records corresponding to both the VSAMfile and the RDBMS data sources, within the context of a specificproject. These semantic records combine the physical metadata containedin Tables 1 and 2 with the semantic bindings in Table 3. For example,the semantic record SR1, corresponding to the VSAM file, would containthe following:

TABLE 4 Semantic Record for VSAM file (SR1) Phys. Phys. Ident. DatatypeSemantic Ident. Semantic Datatype ACC-NO PICX(20) account_number stringTRANS-TYPE PICX(1) transaction_type string TRANS-AMT 9(12)V9(2)transaction_amount number LAST-NAME PICX(20) last_name string FIRST-NAMEPICX(20) first_name string . . . . . . . . . . . .And the semantic record SR2, corresponding to the RDBMS table, wouldcontain the following:

TABLE 5 Semantic Record for RDBMS table (SR2) Phys. Ident. Phys.Datatype Semantic Ident. Semantic Datatype accId VARCHAR(32)account_number string accBal NUMERIC(10, 2) account_balance numberlastName VARCHAR(32) last_name string firstName VARCHAR(32) first_namestring . . . . . . . . . . . .

These semantic records are saved in the repository as part of theproject corresponding to the sample application. In the same project, auser would use the

development tool to create a visual drawing for the application thatreferences these semantic records. To configure the Read-Data function,the user would specify metadata that identifies the location of the VSAMfile from which the data must be read, and associate thepreviously-defined semantic record SRI with the function as thefunction's output semantic record.

To configure the Transform-Data function, the developer would firstconnect the output of the Read-Data function to the input of theTransform-Data function, preferably via the graphical development tool,which represents functions and the connections between them using agraphical representation. Next, the developer would configure the outputof the Transform-Data function to include the semantic identities listedin the semantic record SR2. When specifying semantic entities in theoutput of Transform-Data, the user will be presented with a menu ofrules stored in the repository that operate on those identities(allowing the user to select only valid, predefined rules). In thiscase, suppose that when the user specifies last_name, the user selectsthe title-case rule (as defined above) from the rules menu.

Finally, the developer would connect the output of Transform-Data to theinput of the Write-Data function, and specify the location of the RDBMStable to which the data must be written. As detailed above, the task ofconnecting two functions can be performed visually, using the graphicaldevelopment tool. Throughout the process of configuring the applicationrules, the development tool never reveals the physical identities ordatatypes of the source and target data to the user; this information isencapsulated in the semantic records SR1 and SR2, which are opaque tothe application developer.

V. Output-Oriented Rules

FIG. 7 is a relationship diagram (as described above) that depicts anoutput-oriented rule definition, according to a preferred embodiment.The rule [701] contains the logic and instructions to perform acalculation. Output data [604] is the actual data value that the rulewill calculate and produce when it runs. An output semantic identity[501.2] is a semantic identity [501] from the semantic model (FIG. 5)that identifies the output data.

Depending on the type of rule, the rule may have input which ischaracterized as follows: input data [603], which comprises one or moreactual input data values that the function will consume when it runs; aninput semantic identity [501.1] is a semantic identity [501] from thesemantic model (FIG. 5) that identifies an individual piece of data thatwill be input to the rule (an input parameter).

The rule [701] is defined to calculate a value for an output data field[604] with a given semantic identity [501.1]. All input data [603]required by the rule is identified using semantic identities [501].

There may be an arbitrary number of rules associated with a givensemantic identity. Using the semantic maintenance tool [204] (FIG. 2),these rules can be developed independently from the application orfunction, tested (described in more detail below), stored and indexed bysemantic identity in the repository [201] (FIG. 2), and then used in thetransformation logic of a function.

Traditional data integration processes and systems lack the ability tosemantically reconcile fields in different systems that are beingintegrated. In such processes hundreds, if not thousands, of businessrules are documented for the purpose of mapping fields in source systemsto the appropriate fields in target systems. In the present dataintegration system, because application functions can automaticallycorrelate input and output data semantically, the system does notrequire a process to capture or implement data mapping rules as intraditional systems. These differences are explained in more detail inthe examples below.

However, rules that perform some operation other than a direct movebetween input and output are still needed. The semantic data integrationsystem optimizes the definition and employment of rules by semanticallyorienting them explicitly to output calculation as described.

Recalling the example presented in discussion of FIG. 6, thetransformation logic associated with converting a last name to titlecase could be captured as a reusable output-oriented rule for thelast_name semantic identity. This rule could be used in the sampleapplication, as well as other applications in the same or differentprojects.

FIG. 8 is a relationship diagram (as described above) that extends FIG.7 and adds the concept in FIG. 7 to depict the preferred embodiment ofoutput-oriented rules employment in a function [601].

Transformation logic is configured for the function to perform varioustransformations or manipulations on the input data [603] in order toproduce the correct output data [604]. Rules refer to input data andoutput data using semantic identities [501.1] and [501.2] respectively.

As described above, an example of a rule used in a function may besomething trivial such as changing the case of last name. It may also beused for something more complex such as calculating a weighted accountaverage balance. Preferably, rules are specified using a standardprogramming language that has been extended to include primitives thatoperate on typical database fields.

Using our sample application [101] as an example, when theTransform-Data [103] function is being configured by a developer, thecalculations for its individual output fields are defined. Thepredefined output-oriented rule for title casing last name may bereferenced and used to define the calculation for that field in thefunction. An example of an alternative embodiment of this process allowsfor a new output-oriented rule to be defined at the same time that thetransformation logic for Transform-Data is being configured. In thiscase the pre-existing title casing rule might not already exist and thedeveloper might add it and save it to the repository for general use.

As further examples of output-oriented rules, consider the following:

TABLE 6 Example Output-Oriented Rules Target Rulemaster_account_type_code master_account_type_code account_type_codeaccount_type_code account_start_date datetime.moment(account_open_date,“C”) account_expiration_date account_expiration_dateaccount_ever_activated_code if is.empty(account_date_first_active)   then “N”    else “Y” next_account_number account_number + 1

These rules are written in an untyped programming language, andtype-conversions are performed by the system as necessary. Because thesystem performs type-conversions automatically, the applicationdeveloper does not need to know the semantic datatypes of the semanticidentities used in a rule. For example, the semantic identitiesaccount_number and next_account_number might have semantic datatypes ofstring, and would therefore be represented internally as sequences ofcharacters. However, a developer might treat account_number as aninteger, as illustrated in Table 6, where next_account_number is definedas account_number+1. In this case, the system will recognize that “+” isan operator that applies to integers, convert account_number to aninteger and perform the requested calculation. Finally, it will convertthe result to a string, since the semantic datatype ofnext_account_number is string.

It is not necessary to include rules in the rules repository that merelypass the value of a semantic identity from input to output withoutapplying a transformation (e.g., the rule for master account type_codein Table 6, above). This “pass-through” operation is the defaultbehavior for semantic identities and will be applied if no rule isspecified. Thus, although there is no rule for account_number specifiedabove, any rule that receives account_number as part of its inputsemantic record will pass the received value through to its outputsemantic record.

By contrast, traditional integration systems require the source andtarget locations, table names, and datatypes used in a rule to be storedwith the rule logic. Using the traditional approach, the rules describedin Table 6 might be represented as follows:

TABLE 7 Traditional Representation of Transformation Rules Target TargetSource Source Table Target Column Type Rule Table Source Column TypeACC_INFO MST_ACC_TYPE INTEGER  MST_ACC_TYPE ACCT_INF MST_ACC_TYPEINTEGER ACC_INFO ACC_TYPE_CD CHAR(1)  ACC _CD ACCT_INF ACC _CD CHAR(1)ACC_INFO ACC_ST_DATE DATE  time(ACC_OP, “C”) ACCT_INF ACC_OP DATEACC_INFO ACC_EXP_DATE DATE  ACC_EXP_DT ACCT_INF ACC_EXP_DT DATE ACC_INFOACCT_EVER_ACT CHAR(1) if is.empty(ACC_ACT) ACCT_INF ACC_ACT CHAR(1) then “N”  else “Y” ACC_INFO NXT_ACC_NO CHAR(20) tochar(toint(ACC_NO) + 1) ACCT_INF ACC_NO CHAR(20) ACC_INFO ACC_NOCHAR(20)  ACCT_NUM ACCT_INF ACCT_NUM CHAR(20) ACC_INFO ACC_NO CHAR(20) tochar(ACCT_NO) AC_DATA ACCT_NO INTEGERSpecifying rules in the traditional way (as illustrated in Table 7)requires that the developer know not only the physical locations andnames of the business objects being referenced, but their internal dataformat as well. In such a system, the developer would be forced toperform type conversions explicitly: for example, to add “1” to ACC_NO,the rule “tochar(toint(ACC_NO)+1)” might be used (as opposed to theuntyped rule definition account_number+1, as used above).

Also, it is necessary to specify pass-through rules using thetraditional approach: for example, two rules are defined for ACC_NO inTable 7, both of which read source data from different physical sourceswhose data are stored using different formats. As explained above,output-oriented rules do not require pass-through rules to be specified,because the physical data sources and datatypes are included as part ofa semantic identity.

VI. Synthetic Debugging

FIG. 9 is a relationship diagram (as described above) that extends FIG.6 to depict the preferred embodiment of function-level syntheticdebugging and testing for semantic data integration. A function [601]and its transformation logic [602] (as described above) may be testedusing test data [901]. Test data for each input semantic identity[501.1] may come from a variety of sources including: derived test data[902], which is be automatically derived from the input semantic recordby a generator function [905], specified test data [903], which ismanually specified by the user [906], and existing test data [904],which is retrieved from the repository [201].

The system preserves data security by not exposing actual business datavalues within the development tool or while a data integrationapplication is being executed by the engine. In order to debug and testapplications, synthetic debugging and testing is employed at thefunction level. The ability to provide synthetic test data also allowsfor offline development in situations when the actual data sources mightnot be available.

After initiating a debugging exercise from within the development tool,the user will assign test data [901] for each input semantic identity[501.1]. Test data values can come from multiple sources. A test datagenerator function [905] can use information derived from the inputsemantic record [305.1] to synthetically generate test values [902], theuser [906] may manually specify the test data values [903], or existingtest data values [904] that are cataloged in the repository [201] bysemantic identity may be used. The user may choose to store test datavalues for each semantic identity back to the repository for futuredebugging and testing. Once test data has been assigned, the user cantest the function with these test values. In this test one iteration ofthe function will run using the input test data to produce thefunction's output data [604] which can then be displayed by thedevelopment tool and validated by the user.

Using the sample application [101] as an example, a developer may wantto synthetically debug the Transform-Data function [103], in particularthe logic described above that changes the case of last_name. Thedeveloper may first try to re-use existing test data [904] from therepository. If no test data for last_name is found, the developer maytry to generate test data [902]. Using the metadata from the inputsemantic record, the test data generator [905] may generate test data[902] that looks like this: ‘AaBbCcDdEeFfGgHhIiJj’. Upon testing thisdata with the function, the output correctly produces‘Aabbccddeeffgghhiijj’. In order to further validate the function, thedeveloper specifies his own test data [903]: ‘sT. jOHn’. Upon testingthis data with the function, the output correctly produces ‘St. John’.

The developer saves this new test data to the repository so that it maybe re-used the next time a developer needs test data for last_name. Thisis accomplished by, e.g., associating the new test data value with theassociated semantic identity in a relational database table.

The ability to enter custom test data values allows the developer toensure that a function responds appropriately to certain problematicinput values that might not be generated by the random test datagenerator (e.g. integers that include random letters, negative accountnumbers, etc.). These custom values are associated with a semanticidentity (e.g., last_name), so once entered, they can automatically bereused as test data for any function, in any project, that uses the samesemantic identity.

VII. Enterprise Maintenance

FIG. 10 combines a relationship diagram (as described above) and UMLuse-case diagram to depict the high-level separation of semantic dataintegration user activities, according to a preferred embodiment.Activities performed by system users can be classified either asenterprise maintenance or as application development.

Enterprise maintenance [1001] is performed with the semantic maintenancetool [204] and the project maintenance tool [205], and has two basicsubcategories. Semantic maintenance deals with the maintenance of thesemantic model [202] including semantic identities [501] and physicalidentities [502] (FIG. 5), and output-oriented semantic rules [701](FIG. 7). Project maintenance is concerned with the maintenance of theproject state and architecture-level objects such as semantic records[305] (FIG. 5) and data access definition [309] (FIG. 3) which may bedefined at the project level or across the enterprise when reusabilityis possible.

Application development [1002] is performed with the development tool[206] (FIG. 2) and is concerned with the development of data integrationdrawings [303] within or across projects. Application developmentinvolves many of the objects that fall within the purview of enterprisemaintenance, such as semantic identities and output-oriented rules.However, physical identities are never referenced in the context ofapplication development.

As a result of this enforced separation between application developmentand physical identities, the application developer does not requireknowledge of physical identities [502] or physical data locations[1003]. Applications may be developed independent of the physical datasources that they will integrate, providing a level of insulation fromphysical data sources whose location, connectivity, structure, andmetadata may be unstable.

Recall the example first provided in the discussion for FIG. 3. In thatexample the project manager and data architect were performingenterprise maintenance [1001] activities as described above, includingproject maintenance. Additionally, a data steward would perform semanticmaintenance, such that many of the semantic bindings [503] needed forthe semantic records [305] created during project maintenance wouldalready exist. When creating the semantic records that describe the VSAMfile [105], the enterprise architect may discover that a semanticbinding does not yet exist between the last_name semantic identity andthe LST_NM field in the VSAM file. This binding could have been definedby a data steward during regular semantic maintenance activities. But ifthe binding does not exist, the data architect can also create thatbinding. Once all of the necessary semantic bindings exist, the dataarchitect can complete the task of creating the semantic record for theVSAM file. Once the semantic record is complete and exists as a projectartifact, the developer can be told to use that record.

The developer would then use that semantic record when creating thedrawing [303] that describes the actual application [101]. At no pointdoes the developer need to know anything about the physical nature ofthe VSAM file structure including the physical identities of its data.The developer can work strictly with semantic identities to define thedata integration application.

VIII. Data Integration Engine

FIG. 11 is a control flow relationship diagram that illustrates thecontrol flow within the data integration engine when the exampleapplication [101] is executed on a single host, according to a preferredembodiment. A control flow relationship diagram is a hybrid UMLclass/activity diagram that conveys the directionality of communicationor contact between objects or components. Solid arrows indicatesynchronous communication or contact and broken arrows indicateasynchronous communication or contact.

The data integration engine [207] has a single parent process [1102]which is a top-level operating system process whose task is to execute adata integration application defined by artifacts [302] within aspecific project [203.1] (FIG. 3). The data integration engine usesthese artifacts (e.g., the application drawing) to set up, initialize,and perform the data integration.

Distributed shared memory [1101] is a structured area of operatingsystem memory that is used by the parent process [1102] and the childprocesses [1103.1, 1103.2, 1103.3] running on that host. Each of thesechild processes is responsible for performing a single function withinthe application. In the sample application [101], child process A[1103.1] executes to the Read-Data function [102], child process B[1103.2] executes to the Transform-Data function [103], and childprocess C [1103.3] executes to the Write-Data function [104]. Workerthreads [1104.1-1104.9] subdivide the processing for each child process.

When the parent process [1102] starts, it analyzes the applicationdrawing [303] (FIG. 2) and related metadata in other project artifactsto initialize and run the application. The parent process creates andinitializes a shared properties file (not shown) with control flowcharacteristics for the child processes and threads. The parent processalso creates and initializes the distributed shared memory [1101], asection of which is specifically created for and assigned to each childprocess and thread. Each child process writes information about itsexecution status to its assigned portion of the distributed sharedmemory, and this is used by the parent process [1102] to provide updatesabout the execution status of the data integration engine.

After initialization, the parent process will create each child process[1103.1, 1103.2, 1103.3], synchronously or asynchronously depending onthe nature of the function. One child process is created for eachfunction in the drawing (e.g. Read-Data, Write-Data, and Transform-Data,in the example application [101]). When possible, the engine runs eachfunction in parallel so that one function does not need to complete inorder for the next to begin.

Upon creation, each child process will read characteristics relevant toits execution from the shared properties file. These characteristicsinclude information about how many threads should be runningsimultaneously to maximize parallelism. For example, if the sharedproperties file indicates that there are three different physical datasources for the data read by child process A [1103.1], then childprocess A will spawn three worker threads [1104.1, 1104.2, 1104.3], eachof which loads its data from a different source.

Continuing this example, because child process A is reading data fromthree sources using three different threads, it has three outputs. So,child process B, which transforms the data read by child process A, hasthree sources of input. Child process B accordingly spawns three workerthreads [1104.4, 1104.5, 1105.6], each thread reading the data output byone of the worker threads spawned by child process A. Finally, childprocess C, which writes the output of child process B to the specifiedtarget, spawns three threads, each of which corresponds to a threadspawned by child process B. This thread system allows the dataintegration engine to take advantage of the parallelism made possible bymultiple data sources.

When a function involves reading from or writing to a data source, thedata integration engine examines the application drawing to determinethe type of the data source involved. Based on the type of the datasource, the appropriate interface methods are selected, and the data isread or written accordingly.

Control flow is asynchronous and non-locking between parent process,child processes, and threads. This is achieved by combining anupdate-then-signal asynchronous protocol for all communication (exceptfor communication between threads, described above) andsignaler-exclusive distributed shared memory segmentation. Under theupdate-then-signal protocol, when a parent or child process needs tocommunicate with its child process or thread, respectively, it mayupdate the distributed shared memory of the child and thenasynchronously signal the child. When the child handles the signal, itwill read its updated distributed shared memory (if necessary) andreact. Communication in the other direction is the same. When a threador child process needs to communicate with its parent (child process orparent process, respectively), it may first update its distributedshared memory and then asynchronously signal the parent. When the parenthandles the signal, it will read the updated distributed shared memory(if necessary) and react. The distributed shared memory areas used forcommunication are exclusively written by the signaler, ensuring that twoprocesses never attempt to access the same memory simultaneously.

FIG. 12 is a data flow relationship diagram that extends FIG. 11 todepict the flow of data within the data integration engine [207] whenthe example application is executed on a single host. A data flowrelationship diagram is an extension of the control flow relationshipdiagram (as described above) whose purpose is to convey thedirectionality and flow of data between objects or components. Relevant,previously described, control flow may be shown as muted or grayed,while the objects pertinent to the data flow within that control flowwill be prominent or black. The data flow is captured with an arrowindicating the source of the data (no arrow pointer) and the target ofthe data (arrow pointer) that is optionally labeled with the resourceresponsible for the data flow.

The only additional annotations in FIG. 12 are channels [1201.1-1201.6]which are resources that are used for passing data from a worker threadfor one function to a worker thread for another. Recall from above thatthe role of child process A [1103.1] is to read data (see the Read-Datafunction [102] in FIG. 1), the role of child process B [1103.2] is toproduce new data by applying transformation logic to that data (see theTransform-Data function [103] in FIG. 1), and the role of child processC [1103.3] is to write the data produced by child process B (see theWrite-Data function [104] in FIG. 1). In this model, data flows througha dedicated channel from a worker thread spawned by one child process toa worker thread spawned by another child process. Channels areimplemented directly or indirectly through any means of interprocesscommunication, e.g. named pipes, sockets, riiop, rpc, soap, oob, andmpi.

As described above, each child process subdivides its work usingparallel worker threads. In addition, because the characteristics ofeach function in this example allow for simultaneous processing, childprocess B [1103.2] does not wait for child process A [1103.1] to readall of the data before it begins; it can start transforming datareceived from child process A as soon as child process A outputs anydata. Similarly, child process C [1103.3] does not wait for childprocess B to transform all of the data before it begins; it can startwriting data received from child process B as soon as child process Boutputs any data.

In the application drawing, each semantic record is associated with alist of Universal Resource Indicators (URIs) that point to the relevantdata. These URIs might point to redundant copies of identical data or todata sources containing different data, but all of the indicated datasources must conform to the semantic record format that is specified inthe file. Generally, each URI in the list will be unique, allowing theengine to leverage parallelism by reading data simultaneously fromseveral different locations. However, this is not a requirement, and ifdesired, two or more identical URIs can be listed.

Channel data flow in the sample application is structured as follows:each worker thread on child process A will read data in parallel from adata source specified by one of the listed URIs. As each thread [1104.1,1104.2, 1104.3] spawned by child process A reads data, it makes thatdata available as output from child process A to be used as input forchild process B [1103.2] by moving the data through a dedicated channel.In this example channel A1B1 [1201.1] is a resource that is defined topass data from thread A1 [1104.1] on child process A to thread B1[1104.4] on child process B, channel A2B2 [1201.2] passes data fromthread A2 [1104.2] to thread B2 [1104.5], and channel A3B3 [1201.3]passes data fro thread A3 [1104.3] to thread 6 [1104.6].

When each worker thread is spawned, it receives information that can beused to identify an input channel and an output channel, using apredetermined channel identification scheme. The thread connects to bothof these channels, reads data from the input channel, and writes outputdata to the output channel. Thus, each thread connects on startup to theappropriate data channels.

FIG. 13 is a modified UML collaboration diagram that illustrates thestartup sequence that results when the sample application is executed ina distributed environment comprising three hosts [1301.1, 1301.2,1301.3], according to a preferred embodiment. UML collaboration diagramsare used to convey the order of messages passed between object orcomponents. In the collaboration diagrams used here, existing controlflow relationships may also be depicted in gray in order to preserveuseful context.

The primary difference between this scenario and that depicted in FIG.11 is that processing will be distributed across 3 hosts [1301.1,1301.2, 1301.3] in a networked environment. In particular, the firstfunction is specified to run on host A [1301.1], the second function isspecified to run on host B [1301.2], and the third function is specifiedto run on host C [1301.3].

The application is started by executing the data integration engine[207.1] on host A, the “master” host. During the initial setup of theintegration engine, the master host reads the application drawing todetermine which application functions will be executed on the masterhost. Each function is associated with a list of URIs, each of whichrepresents a host on which the function can be executed. For eachfunction, the application developer selects one of the listed hosts fromthe list, the selection is recorded in the application drawing, and thecorresponding host is used by the data processing engine to execute thefunction. If no host is specified, the function will execute by defaulton the same host as the previous function, if possible.

Binary data is passed between hosts using any standardized protocol andbyte-order. Preferably, network byte-order is used to transfer binarydata between hosts and to temporarily store data on execution hosts.When an operation must be performed that operates on data inmachine-native format, the data is automatically converted tomachine-native byte-order for the operation, and converted back to thestandardized byte-order (e.g., network byte-order) afterwards.

In the particular case of the example application [101], the host Aparent process [1102.1] determines that only the Read-Data function willrun as a child process on host A. The host A parent process creates afull structure for distributed shared memory [1101.1] but only thesections relevant to child processes that need to run on host A will beinitialized, in this case the single child process for the firstfunction. A child process [1103.1] for the Read-Data function is thenstarted on host A in the manner described above. Note that the host Aparent process [1102.1], distributed shared memory [1101.1], and childprocess A [1103.1] are analogous to the parent process [1102],distributed shared memory [1101], and child process A [1103.1] describedabove in FIG. 11 and FIG. 12.

The host A parent process then starts a new engine parent process[1102.2] on host B, passing input indicating that functions alreadyreserved for host A should be ignored. During initial analysis of theinput application, the host B parent process ignores the Read-Datafunction since it is marked for host A and determines that only theTransform-Data function should run as a child process on host B. Asexplained above, this choice was optionally made by the developer duringthe development process and is recorded in the application drawingartifact.

The host B parent process creates a full structure for distributedshared memory [1101.2] but only the sections relevant to child processesthat need to run on host B will be initialized (in this case, the childprocess for the Transform-Data function). A child process for theTransform-Data function is then started on host B in the mannerdescribed above. Note that the host B parent process [1102.2],distributed shared memory [1101.2], and child process B [1103.2] areanalogous to the parent process [1102], distributed shared memory[1101], and child process B [1103.2] described above in FIG. 11 and FIG.12.

The host B parent process then starts a new engine parent process[1102.3] on host C, passing input indicating that functions alreadyreserved for hosts A and B should be ignored. During initial analysis ofthe input application, the host C parent process ignores the Read-Dataand Transform-Data functions because they have been reserved for theother hosts, and determines that only the Write-Data function should runas a child process on host C. As explained above, this choice wasoptionally made by the developer during the development process and isrecorded in the application drawing artifact.

The host C parent process creates a full structure for distributedshared memory [1101.3] but only the sections relevant to child processesthat need to run on host C will be initialized (in this case, the childprocess for the Write-Data function). A child process for the Write-Datafunction is then started on host C in the manner described above. Notethat the host C parent process [1102.3], distributed shared memory[1101.3], and child process C [1103.3] are analogous to the parentprocess [1102], distributed shared memory [1101], and child process C[1103.2] described above in FIG. 11 and FIG. 12.

Because there are no more functions to be allocated at this point, thedistributed startup sequence is complete.

FIG. 14 is a modified UML collaboration diagram (as described above)that extends FIG. 13 to illustrate the process of distributed sharedmemory replication when the sample application is executed in adistributed environment comprising three hosts.

In a distributed processing scenario, additional control flow is neededto communicate the status of each host. The parent process [1102.1] onthe master host A [1301.1] is responsible for directing the entireapplication across hosts. As a result of this, its distributed sharedmemory [1101.1] must reflect the state of all child processes on allchild hosts. To do this distributed shared memory is partiallyreplicated from host to host.

Each child process is responsible for updating its portion of thedistributed shared memory structure at regular intervals. In the exampleapplication, the replication process begins on host C [1301.3] when itsupdate interval arrives. At this point, the parent process [1102.3]writes [1401] its output [1402]. When the update interval for host B[1301.2] is reached, the host B parent process [1102.2] will read [1403]the output from the host C parent process, and update [1404] itsdistributed shared memory [1101.2] with the control data it read asoutput from host C. A cumulative update of control data includingcontrol data from host B and host C is then written [1405] as output[1406] from the parent process. When the update interval for host A isreached, the host A parent process [1102.1] will read [1407] the outputfrom the host B parent process (which it started), and update [1408] itsown host A distributed shared memory [1101.1] with the control data itread as output from host B.

The parent process on the master host periodically reads the contents ofthe distributed shared memory to obtain information related to each ofthe child processes. This process occurs at regular intervals, and istimed according to a user-configurable parameter. The information readfrom the distributed shared memory is used to monitor the progress ofthe child processes and to provide status updates. Using a distributedshared memory structure to provide status updates allows the childprocesses to process data in an uninterrupted fashion, without pausingperiodically to send status messages. Essentially, this creates a systemby which data are transferred from child process to child process“in-band” while status messages and updates are transferred“out-of-band,” separate from the flow of data.

FIG. 15 is a data flow relationship diagram (as described above) thatmerges FIG. 12 and FIG. 13 to illustrate the flow of data in the dataintegration engine when the sample application is run in a distributedenvironment comprising three hosts. The channel data flow method beingemployed is identical to the single-host/single-instance methoddescribed above (FIG. 12), except that the channels now operate to passdata between threads running on different hosts. The communicationchannels that pass data between threads across hosts can be implementedusing any inter-host communication means, including sockets, riiop, rpc,soap, oob, and mpi.

It will be appreciated that the scope of the present invention is notlimited to the above-described embodiments, but rather is defined by theappended claims; and that these claims will encompass modifications ofand improvements to what has been described.

1. A method of developing data integration applications utilizingsemantic identifiers to represent application data fields and variables,the method comprising: a. receiving a set of physical data identifiers,each physical data identifier specifying the name of a physical datafield, the network location of the physical data field, and a method ofaccessing the physical data field; b. storing in a persistent data storea set of semantic names for use in defining data integrationapplications, wherein each semantic name is associated in the persistentdata store with a physical identifier, and wherein each semantic nameindicates the meaning of the data contained in the data field specifiedby said physical identifier; c. defining a data integration applicationcomprising functional rules to extract, transform, and store data,wherein each functional rule is a sequence of programmatic expressionsand wherein each programmatic expression comprises semantic names, whichare stored in the persistent data store and represent input values, andfunctional operators to transform and combine said input values tocreate output values; and d. executing said functional rules byreplacing each of said semantic names with data from the data fieldspecified by the physical identifier that is associated in thepersistent data store with the semantic name; e. such that a user maydefine a data integration application utilizing said semantic names andbe ignorant of the information contained in the physical identifier. 2.The method according to claim 1, further comprising automaticallyconverting the input values from one datatype to another as required bythe functional operators.
 3. The method according to claim 1, furthercomprising: a. analyzing the received physical data identifiers; b.using the results of said analysis to provide a set of suggestedsemantic names for at least some of the physical data identifiers; andc. responsive to user input, selecting a semantic name from the set ofsuggested semantic names and associating the selected semantic name witha physical identifier in the persistent data store.
 4. A system fordeveloping data integration applications utilizing semantic identifiersto represent application data fields and variables, the systemcomprising: a. a persistent data store that is operable to storecomputer-readable data and to store relations among said data; b. logicfor receiving a set of physical data identifiers, each data identifierspecifying the name of a physical data field, the network location ofthe physical data field, and a method of accessing the physical datafield; c. logic for storing in the persistent data store a set ofsemantic names for use in defining data integration applications,wherein each semantic name is associated in the persistent data storewith a physical identifier, and wherein the semantic name indicates themeaning of the data field specified by said physical identifier; d.logic for defining a data integration application comprising functionalrules to extract, transform, and store data, wherein each functionalrule is a sequence of programmatic expressions, each programmaticexpression comprising semantic names, which are stored in the persistentdata store and represent input values, and functional operators totransform and combine said input values to create output values; and e.logic for executing said functional rules by replacing each of saidsemantic names with data from the data field specified by the physicalidentifier that is associated with the semantic name in the persistentdata store; f. such that a user may define a data integrationapplication utilizing said semantic names and be ignorant of theinformation contained in the physical identifier.
 5. The systemaccording to claim 4, further comprising logic for automaticallyconverting the input values from one datatype to another as required bythe functional operators.
 6. The system according to claim 4, furthercomprising: a. logic for analyzing the received physical dataidentifiers; b. logic for using the results of said analysis to providea set of suggested semantic names for at least some of the physical dataidentifiers; and c. responsive to user input, selecting a semantic namefrom the set of suggested semantic names and associating the selectedsemantic name with a physical identifier in the persistent data store.